Skip to content

fix: load WASM grammars sequentially to avoid Node 20+ race condition#40

Merged
colbymchenry merged 1 commit into
colbymchenry:mainfrom
ravescovi:fix/sequential-grammar-loading
Feb 19, 2026
Merged

fix: load WASM grammars sequentially to avoid Node 20+ race condition#40
colbymchenry merged 1 commit into
colbymchenry:mainfrom
ravescovi:fix/sequential-grammar-loading

Conversation

@ravescovi
Copy link
Copy Markdown

Summary

  • Replace Promise.allSettled(entries.map(...)) with a sequential for...of loop in initGrammars() to avoid a known web-tree-sitter WASM race condition on Node.js 19+/20+
  • When multiple grammars with external scanners (TypeScript, TSX, C#, Swift, Kotlin, Dart, etc.) are loaded concurrently, V8's WebAssembly runtime hits a symbol resolution race where one grammar's exports overwrite another's GOT entries
  • This produces errors like bad export type for 'tree_sitter_tsx_external_scanner_create': undefined, causing those languages to silently fail to index

Root Cause

web-tree-sitter WASM instantiation is not safe for concurrent Language.load() calls on Node.js 19+ (V8 10.8+). The external scanner symbols from one grammar can collide with another's during parallel initialization.

Documented upstream:

Test plan

  • Verified on Node.js v20.20.0 (Linux x86_64) — all 16 grammars load successfully after the fix
  • Before fix: only Python indexed (303 files). After fix: Python + TSX + TypeScript + JavaScript + JSX (408 files)
  • No grammar Failed to load warnings in output after the change

web-tree-sitter has a known race condition when loading multiple WASM
grammars concurrently on Node.js 19+ (V8 10.8+). External scanner
symbols from one grammar can overwrite another's GOT entries, causing
"bad export type" errors for TypeScript, TSX, and other languages.

Replace Promise.allSettled(entries.map(...)) with a sequential for...of
loop so each grammar fully initializes before the next one starts.

Ref: tree-sitter/tree-sitter#2338
Copy link
Copy Markdown
Owner

@colbymchenry colbymchenry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work! Thank you!

@colbymchenry colbymchenry merged commit 5d699ab into colbymchenry:main Feb 19, 2026
andreinknv added a commit to andreinknv/codegraph that referenced this pull request May 18, 2026
Final wave of the codegraph tool-audit friction sweep.

Polish:
- colbymchenry#25 compare_to_ref now reports body-only-edited files explicitly.
- colbymchenry#26 compare_to_ref includeEdges renders symbol names (not raw IDs),
      filters self-edges, drops empty file headers.
- colbymchenry#27 codegraph_coverage gains sources (list) + drop modes; the two
      audit-residue coverage sources were removed from the index.
- colbymchenry#28 role classifier ROLE_LIST_TEXT requires structural route/handler
      evidence — stops api_endpoint over-assignment from docstrings.
- colbymchenry#29 status topBiomarkers emits an explicit clean/0-findings line.
- colbymchenry#30 discover skips test-fixture indices (FIXTURE_DIR_NAMES).
- colbymchenry#31 CLI reload-modules warns it has no lasting effect (ephemeral).
- colbymchenry#32 session/note CLI --limit defaults aligned to MCP (20 / 50).
- colbymchenry#33 CLI ask renders the verified-citations block (shared
      buildCitationReport helper, reused from the MCP path).
- colbymchenry#34 codegraph_session gains a delete action + session delete CLI
      subcommand + deleteSession query helper.
- colbymchenry#40 fuzzy-fallback banner extended to coverage + role symbol modes.

Docs:
- colbymchenry#35 find intent-mode hint references codegraph_graph, not the
      removed codegraph_callees / codegraph_walk.
- colbymchenry#36 dead_code via=rule footer recommends via=llm.
- colbymchenry#37 sql read-only rejection message names both the MCP and CLI
      schema-flag forms (surface-neutral).
- colbymchenry#38 serve --no-write-tools help names real write-class tools.
- colbymchenry#39 CLI local-chat help says "local LLM" to match MCP.

Reviewer APPROVE; info findings (CLAUDE.md CLI docs, JSDoc wording)
addressed. Suite 3037 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mbenhamd referenced this pull request in mbenhamd/codegraph May 24, 2026
…rift/diff foundation) (#38)

* feat(PF-690): schema v6 + per-symbol fingerprint columns for duplicate/drift/diff infrastructure

First slice of the trace/duplicate/drift roadmap that Codex + agy
debated in the design RFC. Pure data infrastructure — no new CLI/MCP
surface yet. PR #39 (codegraph_diff), PR #40 (codegraph_duplicates),
and PR #41 (codegraph_explain) will consume these columns.

## What changed

- `src/extraction/fingerprints.ts` (new, ~245 lines): SHA-256 hashes
  computed from the in-memory tree-sitter subtree.
  - `astHash` (Type-1): normalized token stream with identifiers +
    literals preserved exactly. Detects "same code, only
    whitespace/comments differ".
  - `astShapeHash` (Type-2): identifier leaves in non-semantic
    positions replaced by `_ID`. Detects "same code, renamed
    locals". Property/field/type identifiers preserved by type;
    member-access targets, callees, kwarg names, type names,
    import names preserved by parent-context check.
  - `sigHash`: SHA-256 of the signature string. Null when no
    signature was extracted.
  - Comment + whitespace stripped; trivia tokens (commas,
    semicolons, braces) excluded via `namedChild` walk.

- Schema v6 migration (`src/db/migrations.ts:93-119`,
  `src/db/schema.sql`): adds 4 nullable columns to `nodes` table —
  `ast_hash`, `ast_shape_hash`, `sig_hash`, `call_pattern_hash`.
  Partial indexes on the two body hashes (`WHERE NOT NULL`) so
  duplicate-detection sweeps are O(log N) lookups instead of full
  scans. `callPatternHash` is reserved for post-resolution
  population by a later PR.

- Extraction wiring (`src/extraction/tree-sitter.ts:425-460`):
  computes the three body hashes from the already-parsed tree-sitter
  subtree inside `createNode`. agy's RFC point — that the AST is
  already in memory and hashing is microseconds — verified on a
  107-file codegraph src/ corpus: 2.4s with vs 2.9s without
  (overhead below run-to-run variance, well under Codex's ≤15%
  budget).

- `Node` interface (`src/types.ts:165-198`): adds nullable
  `astHash`, `astShapeHash`, `sigHash`, `callPatternHash` fields
  with provenance docstrings.

- `queries.ts` insertNode / updateNode / rowToNode: round-trip the
  fingerprint columns nullably so framework-extractor synthesized
  route nodes (no body) keep `null` fingerprints — downstream
  consumers filter with `WHERE ast_hash IS NOT NULL`.

## v1 contract (Council RFC, locked by tests)

- Detects: Type-1 (whitespace/comment-insensitive clones),
  Type-2 (renamed-locals clones).
- Does NOT detect: Type-3 (statement reorder), Type-4 (semantic
  equivalence).
- Literal values preserved → security/config code where the literal
  matters does not falsely conflate. Strongest counterpoint the
  council named ("miss literal-only differs") explicitly accepted.

## Bug-pin verified during review

Codex pass 1 caught a real BLOCKER: `tree-sitter-python` parses
`obj.start()` as `attribute(identifier "obj", identifier "start")`
(both children are plain `identifier`), so a type-only rename rule
would have conflated `obj.start()` with `obj.stop()`. Fix: parent-
context check (`shouldPreserveIdentifier`) preserves identifiers
in semantic positions — `attribute` children, `call.function`
field, `keyword_argument` children, types, imports. Codex round 2
caught a follow-on: Python kwargs (`g(start=1)`) — added
`keyword_argument` to the semantic-parent set.

## Tests (`__tests__/fingerprints.test.ts`, 14 cases)

- sigHash determinism + null on missing signature.
- Determinism: same input → same hex.
- Type-1: whitespace/comment edits preserve astHash.
- Type-2: renamed locals share astShapeHash, NOT astHash.
- Member rename diverges (TS `property_identifier` path).
- Literal change diverges (security sensitivity pinned).
- Control-flow reorder diverges (Type-3 NOT detected, pinned).
- Python regression: `obj.start()` vs `obj.stop()` diverge
  (member preserved despite both being `identifier`).
- Python bare callee: `start()` vs `stop()` diverge.
- Python kwarg: `g(start=1)` vs `g(stop=1)` diverge.
- Python param rename: same astShapeHash, different astHash.
- Cross-language: TS body ≠ Python body even when semantically
  equivalent.

## Reviewer trail

- Codex pass 1: 1 BLOCKER (Python member conflation) + 1 REVIEW
  (missing Python tests) + 1 NITPICK (stale comment).
- Codex round 2: BLOCKER + REVIEW CLOSED. New REVIEW (Python
  kwarg conflation) + NITPICK (header repeated stale claim).
- Codex round 3: Both round-2 findings CLOSED. Last NITPICK
  (kwarg comment misrepresented the over-preservation trade-off).
  Codex authorized "iterate for the comment fix, then ship".
- Doc comment now accurately describes the trade-off: kwarg
  set-membership preserves any direct identifier leaf including
  value-side identifiers; tighter field-specific check deferred
  to a follow-up.

## Verification

- tsc --noEmit clean
- npm test: 1026 passed | 2 skipped (was 1012 on main; +14
  fingerprint tests)
- npm run test:eval:structural: 8/8 PASS, recall=1.00 precision=1.00
  fp=0 (no regression vs main baseline)
- Index-time delta on 107-file corpus: 2.4s with vs 2.9s without —
  below run-to-run variance, well under ≤15% target.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(PF-690): cross-language member preservation + idempotent v6 migration (Codex round 4)

Codex round 4 deep sweep verified via real `tree-sitter-wasms` parses
that the v1 fingerprint rule conflated semantically different code
in Ruby, Java, C#, and Rust. The original rule only handled Python's
`attribute` shape + the `call.function` field. Each of these four
languages emits plain `identifier` for member/callee positions but
under DIFFERENT parent node types:

  - Ruby:    `user.start` -> call(identifier "user", identifier "start"),
             method field carries the member name (not `function`).
  - Java:    `obj.start()` -> method_invocation(identifier, identifier).
  - C#:      `obj.Start()` -> invocation_expression > member_access_expression(identifier, identifier).
  - Rust:    `Router::new()` -> call_expression > scoped_identifier(identifier "Router", identifier "new").

Fix: extend `SEMANTIC_PARENT_TYPES` with `method_invocation`,
`member_access_expression`, `invocation_expression`,
`scoped_identifier`, `scoped_call_expression`, `field_expression`.
Add `call.method` field check to `shouldPreserveIdentifier` to
cover Ruby's dual-purpose `call` type. Same set-membership v1
trade-off applies (accepts false negative on receiver names rather
than risk semantic-name conflation).

Plus Codex round 4 REVIEW: migration v6 was not idempotent under
concurrent-open race. Two processes hitting a v5 database could
both read version 5, both enter migration, and the second's
`ALTER TABLE ADD COLUMN` would fail with duplicate-column even
though the resulting schema is fine. Fixed via `PRAGMA table_info`
pre-check per column so already-applied additions become no-ops.
`CREATE INDEX IF NOT EXISTS` was already idempotent.

Tests added (4 cross-language regressions):
- Ruby `user.start` vs `user.stop` -> different astShapeHash
- Java `obj.start()` vs `obj.stop()` -> different astShapeHash
- C# `obj.Start()` vs `obj.Stop()` -> different astShapeHash
- Rust `Router::new()` vs `Router::default()` -> different astShapeHash

Each pins the specific cross-language failure mode Codex verified.

Reviewer trail:
- Codex round 4 (deep sweep, 6 attack vectors): found 1 BLOCKER
  (cross-language conflation) + 1 REVIEW (migration race). Both
  fixed; remaining 4 vectors confirmed clean (hash determinism,
  persistence completeness, ERROR/MISSING handling, createNode
  hook coverage).
- CodeRabbit CLI: ran against the same diff, no findings.
- Claude Explore subagent: returned 7 findings; 3 already covered
  here (cross-language tests, migration safety), 4 deferred as
  documentation/contract clarifications (line-ending CRLF
  normalization, downstream kwarg trade-off doc, callPatternHash
  contract clarity, SQLite version compat — node:sqlite ships
  SQLite 3.42+ which fully supports partial indexes).

Verification:
- tsc --noEmit clean
- npm test: 1030 passed | 2 skipped (was 1026 last commit;
  +4 cross-language tests)
- npm run test:eval:structural: 8/8 PASS, recall=1.00 precision=1.00
  fp=0 (no regression vs baseline)
- All 18 fingerprint tests pass deterministically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants